Easy to understand. Unlike other algo like SVM, Deep Learning etc.
Way to other algorithms. It is just like a foundation algorithm.
Basically three types of LR are there..
Simple LR (that only have one input and one output column.) e.g., CGPA and IQ level. This Data can be used to predict the IQ based on CGPA.
Multiple LR (more than one input column instead of one.) e.g., Car mileage, brand, fuel type, HP, price_for_sale. This data consists of only one output column i.e., price_for_sale.
Polynomial LR (used when data is not linear completely).
2 Simple Linear Regression
First step in any ML algo is to Plot the data first.
Code
import seaborn as snsimport pandas as pdimport matplotlib.pyplot as pltimport numpy as npfrom pydataset import dataimport plotly.express as pximport plotly.graph_objects as gofrom sklearn.metrics import mean_absolute_error,mean_squared_error,r2_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom ipywidgets import interactivefrom sklearn.datasets import make_regressionimport cufflinks as cffrom plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot%matplotlib inlineinit_notebook_mode(connected=True)cf.go_offline()%matplotlib inlineplt.style.use('ggplot')plt.scatter(x =range(1,50), y =range(11,60))plt.xlabel("feature 1")plt.ylabel("feature 2")plt.title("Completely Linear data")plt.show()
What we will do in LR is plot a best fit line that passing through all the points using the equation \(y = mx + c\)
Code
plt.scatter(x =range(1,50), y =range(11,60),s=40, label ="data points")plt.plot(range(-1,52), range(9,62), label ="Perfect fit line", color="#0f4c81")plt.title("Perfect Fit line using LR")plt.xlabel("feature 1")plt.ylabel("feature 2")plt.legend()plt.show()
Now, let see the real world data.
Code
import pandas as pddata_placement = pd.read_csv("placement.csv")data_placement.head(5)
cgpa
package
0
6.89
3.26
1
5.12
1.98
2
7.82
3.25
3
7.42
3.67
4
6.94
3.57
Code
plt.scatter(x = data_placement["cgpa"], y = data_placement['package'])plt.xlabel("CGPA")plt.ylabel("Package in LPA")plt.title("Placement Data")plt.show()
Real world data is not comletely linear, but a sort of linear. It is because it contains noise in the data. Still, we will do the same task as before i.e., fitting a Best fit Line instead of perfect fit line on this data. Best fit line means that line will have minimum error on this data. What Linear regression do is, find that value of m and c in y = mx + c through which the line is closely passing through all the data points. This line is called Best_fit_Line.
Code
plt.scatter(x = data_placement["cgpa"], y = data_placement['package'], label ="data points")plt.plot(data_placement["cgpa"], (0.5696* data_placement['cgpa'] +-0.9857), color ='darkgreen', label ="Best Fit Line")plt.xlabel("CGPA")plt.ylabel("Package in LPA")plt.title("Placement Data")plt.legend()plt.show()
2.1 Model Building and prediction
Let see how to build a model like this in Python.
Training input
Code
x = data_placement.iloc[:,0:1]y = data_placement.iloc[:, 1:]x.head(5)
cgpa
0
6.89
1
5.12
2
7.82
3
7.42
4
6.94
Training output
Code
y.head(5)
package
0
3.26
1
1.98
2
3.25
3
3.67
4
3.57
# Creating a linear regression model on itfrom sklearn.model_selection import train_test_split # train test splitx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)from sklearn.linear_model import LinearRegression # importing Linear regressionlr = LinearRegression() # making object of the classlr.fit(x_train, y_train) # fitting on training data
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
There are some errors in our model. Let see what are the values of slope and intercept our model choose.
display(lr.intercept_)display(lr.coef_)
array([-1.02960704])
array([[0.57633042]])
Code
plt.scatter(x = data_placement["cgpa"], y = data_placement['package'], label ="data points")plt.plot(data_placement["cgpa"], (0.57* data_placement['cgpa'] +-1.03), label ="Linear Regression", color='#0f4c81')plt.xlabel("CGPA")plt.ylabel("Package in LPA")plt.legend()plt.show()
Code
ypred = lr.predict(x_test)
2.2 Intuition behind Linear Regression
y = mx + c i.e., Package = m CGPA + c
m = weightage… i.e, how much package (y) depending on CGPA (x).
if value of slope increases, the dependancy of package (y) on cgpa (x) also increase, and if value of slope decreases, the dependancy of package (y) on cgpa also derease (x).
Now, let say, in a data with two columns Package and Experience, Package = m * Experience + c and intercept c is 0. if m = 0 => that means Package also become 0 i.e., Freshers do not get any salary. But this is not true, as even freshers get some package. And this package at m = 0 is c. We can say it as offset or intercept. We can say, The regression_constant or offset or intercept or c tells us the predicted value of the dependent variable or output or c when all of the independent variables or input equal 0.
The general equation of straight line in linear regression is,
\(y_{i} = \beta_{0} + \beta_{1} * x_{i} +\epsilon\) where, \(\epsilon\) = error we make during prediction.
Now, we understood the intuition behind Linear regression, so we will learn how to find the values of \(\beta_{0}\) and \(\beta_{1}\). We can find the values \(\beta_{0}\) and \(\beta_{1}\) in two ways:
Closed form solution (i.e., using formulas that do not use derivatives) e.g., Shree dharacharya formula or quadratic formula for solving quadratic equations, OLS(Ordinary least square) method for solving \(\beta_{0}\) and \(\beta_{1}\).
Non - closed form solution (i.e, using formulas that use derivatives) e.g., differential equations, Gradient descend method for solving \(\beta_{0}\) and \(\beta_{1}\).
The Algo which we have seen above i.e., Linear Regression from scikit-learn uses OLS method to solve for \(\beta_{0}\) and \(\beta_{1}\). The other algo like SGD regressor uses gradient descent method. OLS method is good for solving for \(\beta_{0}\) and \(\beta_{1}\) in low dimensional data. Gradient descent method is good for solving for \(\beta_{0}\) and \(\beta_{1}\) where data has more dimensions or say have more input features where OLS method will be painful.
where \(y_{i}\) = dependent values, \(i\) = 1,2,3……. n OLS estimates the parameters e.g., \(\beta_{0}\) and \(\beta_{1}\) by finding that values for the \(\beta_{0}\) (constant) and \(\beta_{1}\) (coefficient) that minimize the sum of the squared errors of prediction, i.e., the differences between a case’s actual score on the output variable and the score we predict for them using actual scores on the input variables. Now, in running the regression model, what are trying to do is to minimize the sum of the squared errors of prediction – i.e., of the \(\epsilon_{i}\) values – across all cases. Mathematically, this quantity can be expressed as :
Specifically, what we want to do is find the values of \(\beta_{0}\) and \(\beta_{1}\) that minimize the quantity in Equation 2 above. For this, we need to express SSE in terms of \(\beta_{0}\) and \(\beta_{1}\), take the derivatives of SSE with respect to \(\beta_{0}\) and \(\beta_{1}\), set these derivatives to zero, and solve for \(\beta_{0}\) and \(\beta_{1}\).
Our calculated parameters are matching with the parameters of linear regression model. Now, plotting linear regression line using above parameters we calculated..
Code
plt.scatter(x = data_placement["cgpa"], y = data_placement['package'], label ="data points")plt.plot(data_placement["cgpa"], (0.576* data_placement['cgpa'] +-1.03), color ='#0f4c81', label ="Linear Regression")plt.xlabel("CGPA")plt.ylabel("Package in LPA")plt.legend()plt.show()
2.4 Relationship between slope and squared errors
Code
# plt.style.use('default')plt.figure(figsize = (12,8))plt.scatter(x = data_placement["cgpa"], y = data_placement['package'], label ="data points")plt.plot(data_placement["cgpa"], (0.4* data_placement['cgpa'] +-1.03), color ='r', label ="m = 0.4")plt.plot(data_placement["cgpa"], (0.45* data_placement['cgpa'] +-1.03), color ='orange', label ="m = 0.45")plt.plot(data_placement["cgpa"], (0.5* data_placement['cgpa'] +-1.03), color ='c', label ="m = 0.5")plt.plot(data_placement["cgpa"], (0.55* data_placement['cgpa'] +-1.03), color ='purple', label ="m = 0.55")plt.plot(data_placement["cgpa"], (0.575* data_placement['cgpa'] +-1.03), color ='black', label ="m = 0.575")plt.plot(data_placement["cgpa"], (0.585* data_placement['cgpa'] +-1.03), color ='indigo', label ="m = 0.585")plt.plot(data_placement["cgpa"], (0.6* data_placement['cgpa'] +-1.03), color ='grey', label ="m = 0.6")plt.plot(data_placement["cgpa"], (0.65* data_placement['cgpa'] +-1.03), color ='y', label ="m = 0.65")plt.plot(data_placement["cgpa"], (0.7* data_placement['cgpa'] +-1.03), color ='violet', label ="m = 0.7")plt.plot(data_placement["cgpa"], (0.75* data_placement['cgpa'] +-1.03), color ='g', label ="m = 0.75")plt.xlabel("CGPA")plt.ylabel("Package in LPA")plt.title("different slope values for best fit line")# plt.grid()plt.legend()plt.show()
Code
slopes = [0.4, 0.45, 0.5, 0.54, 0.575, 0.62, 0.65, 0.7, 0.75]intercept =-1.03errors = []for i in slopes:sum=0for j inrange(len(data_placement['package'])):sum+= (data_placement['package'].values[j] - (i * data_placement['cgpa'].values[j] + intercept))**2 errors.append(sum)plt.plot(slopes, errors)plt.xlabel("Slopes")plt.ylabel("Sum of squared Errors")# plt.grid()plt.show()
Neither larger slope are good nor small slopes are good. We need optimum slope value ot minimize the error. e.g., at slope = \(0.575\), errors = \(21.37\) which is minimum.
2.5 Relationship between intercept and squared errors
Code
plt.figure(figsize = (12,8))plt.scatter(x = data_placement["cgpa"], y = data_placement['package'], label ="data_placement points")plt.plot(data_placement["cgpa"], (0.57* data_placement['cgpa'] +-2.5), color ='r', label ="c = -2.5")plt.plot(data_placement["cgpa"], (0.57* data_placement['cgpa'] +-2), color ='orange', label ="c = -2")plt.plot(data_placement["cgpa"], (0.57* data_placement['cgpa'] +-1.5), color ='y', label ="c = -1.5")plt.plot(data_placement["cgpa"], (0.57* data_placement['cgpa'] +-1.03), color ='black', label ="c = -1.03")plt.plot(data_placement["cgpa"], (0.57* data_placement['cgpa'] +-0.5), color ='indigo', label ="c = -0.5")plt.plot(data_placement["cgpa"], (0.57* data_placement['cgpa'] +0), color ='violet', label ="c = 0")plt.plot(data_placement["cgpa"], (0.57* data_placement['cgpa'] +0.5), color ='g', label ="c = 0.5")plt.xlabel("CGPA")plt.ylabel("Package in LPA")plt.title("different intercept values for best fit line")plt.legend()# plt.grid()plt.show()
Code
slope =0.575intercepts = np.arange(-2.5, 0.6, 0.5)errors = []for i in intercepts:sum=0for j inrange(len(data_placement['package'])):sum+= (data_placement['package'].values[j] - (slope * data_placement['cgpa'].values[j] + i))**2 errors.append(sum)plt.plot(intercepts, errors)plt.xlabel("intercepts")plt.ylabel("Sum of Squared Errors")# plt.grid()plt.show()
Again, Neither larger intercept are good nor small intercept are good. We need optimum intercept value ot minimize the error. e.g., at intercept = \(-1.0\), errors = \(21.47\) which is minimum.
2.6 Relationship between both intercept & slope with squared errors
Changing slope and intercept together
3 Regression Metrics
Machine learning model cannot have 100 % efficiency otherwise the model is known as a biased model. which further includes the concept of overfitting and underfitting. It is necessary to obtain the accuracy on training data, But it is also important to get a genuine and approximate result on unseen data otherwise, Model is of no use. So, to build and deploy a generalized model we require to Evaluate the model on different metrics which helps us to better optimize the performance, fine-tune it, and obtain a better result. If one metric is perfect, there will be no need for multiple metrics. But, different evaluation metric fits on a different set of a dataset.
where, \(y\) = Actual values, \(\hat{y}\) = predicted values.
MSE is a most used and very simple metric with a little bit of change in mean absolute error. Mean squared error states that finding the squared difference between actual and predicted value.
Advantages
The graph of MSE is differentiable at all values of \(x\), so you can easily use it as a loss function.
Limitations
The value you get after calculating MSE is a squared unit of output. for example, the output variable is in meter(\(m\)) then after calculating MSE the output we get is in meter squared (\(m^2\)).
If you have outliers in the dataset then it penalizes the outliers most and the calculated MSE is bigger. So, in short, it is not Robust to outliers which were an advantage in MAE. e.g., Red Mark in below figure.
What MSE acutally calculate is, sum up the area of all squares and trying to minimize it..
Error square
from sklearn.metrics import mean_squared_errorprint("MSE = ", mean_squared_error(y_true=y_test, y_pred=ypred))
As RMSE is clear by the name itself, that it is a simple square root of mean squared error. Most of the times, RMSE is used. In Deep Learning, RMSE is mostly used metric.
Advantages
The output value you get is in the same unit as the required output variable which makes interpretation of loss easy.
Limitations
It is not that robust to outliers as compared to MAE.
from sklearn.metrics import mean_squared_errorprint("RMSE = ",np.sqrt(mean_squared_error(y_true=y_test, y_pred=ypred)))
where, \(SS_{Res}\) = Residual Sum of Squares, \(SS_{Tot}\) = Total Sum of Squares, \(y\) = Actual values, \(\hat y\) = predicted value, \(\bar y\) = mean of y
• R square score is a metric that tells the performance of your model, not the loss in an absolute sense that how well did your model perform. R square score also known as coefficient of determination or goodness of fit. • R-square is a comparison of the residual sum of squares (\(SS_{Res}\)) with the total sum of squares (\(SS_{Tot}\)). • The value of R square lies between 0 to 1 generally. We get R-square == 1, when the model perfectly fits the data and there is no difference between the predicted value and actual value. However, we get R-square == 0 when the model does not predict any variability in the model and it does not learn any relationship between the dependent and independent variables and just predicting the average. In this case, both mean and regression lines are overlapping means model performance is worst, it is not capable to take advantage of the output column.
Que. Is it possible to have R-squared score less than zero?
Advantages
MAE and MSE depend on the context as we have seen whereas the \(R^2\) score is independent of context.
Applicable on multiple algorithms.
Limitations
It is expected that more no. of input variables will explain more variation of Output i.e., more information, more accuracy. The value of r-square always increases or remains the same as new variables are added to the model, without detecting the significance of this newly added variable (i.e. the value of r-square never decreases on the addition of new attributes to the model). As a result, non-significant attributes can also be added to the model with an increase in the r-square value.
This is because \(SS_{Tot}\) is always constant and the regression model tries to decrease the value of \(SS_{Res}\) by finding some correlation with this new attribute hence the overall value of r-square increases, which can lead to a poor regression model.
Note: The value of R-square can also be negative when the model fitted is worse than the average fitted model.
R-squared (\(R^2\)) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable in a regression model.
It explains the extent to which the variance of one variable explains the variance of the second variable. So, the \(R^2\) of our model is \(0.73\), that means approximately 73% of the total observed variation of Package can be explained by the model’s inputs i.e., cgpa . Remaining 17% variation is not explained by model input, and there can be any reason which can not be incorporated here using maths like interview gone well, or candidate has some reference etc..
3.5 Adjusted R Squared Error (Ra2)
The problem with \(R^2\) is when we add an irrelevant feature in the dataset then at that time \(R^2\) sometimes starts increasing which is incorrect. \[ R^2_a = 1 - \Bigg [\Bigg (\frac{n - 1}{n - k - 1}\Bigg) \times (1 - R^2) \Bigg ]\]
where, \(R^2\) = r squared error, \(n\) = Total sample size (no. of rows), \(k\) = number of predictors
Now as \(k\) increases ↑ by adding some features so the denominator \((n -k -1)\) will decrease ↓ , \((n - 1)\) will remain constant. \(R^2\) score will remain constant or will increase ↑ slightly so the complete answer will increase ↑ and when we subtract this from one then the resultant score \(R^2_a\) will decrease ↓. So, this is the case when we add an irrelevant feature in the dataset.
And if we add a relevant feature then the \(R^2\) score will increase ↑ and \((1 - R^2)\) will decrease heavily ↓↓ and the denominator \((n -k -1)\) will also decrease ↓ so the complete term decreases ↓ , and on subtracting from one, the \(R^2_a\) score increases ↑.
Hence, this metric becomes one of the most important metrics to use during the evaluation of the model. This method is useful when we have multiple inputs like multiple linear regression.
# function for adjusted r squaredef r2_adj(n, k, r2): temp =1-((n-1)/(n-k-1) * (1-r2))return tempscore = r2_adj(n =len(x_test), k =len(x_test.columns), r2 = r2_score(y_true=y_test, y_pred = ypred))print("R2 adj = ", np.round(score, 3))
R2 adj = 0.723
Adding a irrelevant random column in our data to check if \(R^2_a\) decreases or not
Code
new_df = data_placement.copy()new_df["random_feature1"] = np.random.random(200)print("New data with random feature")new_df.head()
def r2_adj(n, k, r2): temp =1-((n-1)/(n-k-1) * (1-r2))return tempadj_r2_2 = r2_adj(n =len(x_test2), k =len(x_test2.columns), r2 = r2_score(y_true=y_test2, y_pred = y_pred2))print("new_df2 R-sq adjusted = ", adj_r2_2)# old adj r squaredprint("old adj r squared =", score)
new_df2 R-sq adjusted = 0.8507446702503348
old adj r squared = 0.7226040784587475
Adj R squared increased…
4 Assumptions of Linear Regression
The theory of linear regression is based on certain statistical assumptions. It is crucial to check these regression assumptions before modeling the data using the linear regression approach.
Mainly there are 7 assumptions taken while using Linear Regression:
1. Linear relationship between the independent and dependent variables
The reason behind this relationship is that if the relationship will be non-linear which is certainly is the case in the real-world data then the predictions made by our linear regression model will not be accurate and will vary from the actual observations a lot.
2. No Multicolinearlity in the data
Multicollinearity is a statistical phenomenon that occurs when two or more independent variables in a regression model are highly correlated with each other. In other words, multicollinearity indicates a strong linear relationship among the predictor variables. This can create challenges in the regression analysis because it becomes difficult to determine the individual effects of each independent variable on the dependent variable accurately.
Multicollinearity can lead to unstable and unreliable coefficient estimates, making it harder to interpret the results and draw meaningful conclusions from the model. It is essential to detect and address multicollinearity to ensure the validity and robustness of regression model For e.g., Balram loves watching television while munching on chips. The more television he watches, the more chips he eats, and the happier he gets!
Now, if we could quantify happiness and measu Balram’sin’s happiness while he’s busy doing his favorite activity, which do you think would have a greater impact on his happiness? Having chips or watching television? That’s difficult to determine because the moment we try to measBalramolin’s happiness from eating chips, he starts watching television. And the moment we try to measure his happiness from watching television, he starts eating ch Eating chips and watching television are highly correlated in the case of Balram, and we cannot individually determine the impact of individual activities on his happiness. This is the multicollinearity problem!l.
3. Homoscedasticity of Residuals or Equal Variances
Homoscedasity is the term that states that the spread residuals which we are getting from the linear regression model should be homogeneous or equal spaces. It refers to a condition in which the variance of the residual, or error term, in a regression model is constant. That is, the error term does not vary much as the value of the predictor variable changes. refers to a condition in which the variance of the residual, or error term, in a regression model is constant. That is, the error term does not vary much as the value of the predictor variable changes. The lack of homoscedasticity may suggest that the regression model may need to include additional predictor variables to explain the performance of the dependent variable.
For example, suppose you wanted to explain student test scores using the amount of time each student spent studying. In this case, the test scores would be the dependent variable and the time spent studying would be the predictor variable.
The error term would show the amount of variance in the test scores that was not explained by the amount of time studying. If that variance is uniform, or homoscedastic, then that would suggest the model may be an adequate explanation for test performance—explaining it in terms of time spent studying.
But the variance may be heteroscedastic. A plot of the error term data may show a large amount of study time corresponded very closely with high test scores but that low study time test scores varied widely and even included some very high scores.
So the variance of scores would not be well-explained simply by one predictor variable—the amount of time studying. In this case, some other factor is probably at work, and the model may need to be enhanced in order to identify it or them.
Further investigation may reveal that some students had seen the answers to the test ahead of time or that they had previously taken a similar test, and therefore didn’t need to study for this particular test. To improve on the regression model, the researcher would have to try out other explanatory variables that could provide a more accurate fit to the data. If, for example, some students had seen the answers ahead of time, the regression model would then have two explanatory variables: time studying, and whether the student had prior knowledge of the answers.
With these two variables, more of the variance of the test scores would be explained and the variance of the error term might then be homoskedastic, suggesting that the model was well-defined.
4. No Autocorrelation in residuals
One of the critical assumptions of multiple linear regression is that there should be no autocorrelation in the data. When the residuals are dependent on each other, there is autocorrelation. This factor is visible in the case of stock prices when the price of a stock is not independent of its previous one. Another example is, the temperatures on different days in a month are autocorrelated. The temperature the next day tends to rise when it’s been increasing and tends to drop when it’s been decreasing during the previous days.
5. Number of observations Greater than the number of predictors
For a better-performing model, the number of training data or observations should be always greater than the number of test or prediction data. However greater the number of observations better the model performance. Therefore, to build a linear regression model you must have more observations than the number of independent variables (predictors) in the data set. The reason behind this can be understood by the curse of dimensionality.
6. Each observation is unique
It is also important to ensure that each observation is independent of the other observation. Meaning each observation in the data set should be measured separately on a unique occurrence of the event that caused the observation.
7. Predictors are distributed Normally
It is a good habit to check graphically the distributions of all variables, both dependent and independent. If some of them are slightly skewed, keep them as they are. On the other hand, highly skewed variables should be normalized before fitting the model.
After fitting the model, it is necessary to make sure that the residuals are distributed normally, to ascertain its technical correctness.
One can get an idea of the distribution of the predicted values by plotting density, KDE, or QQ plots for the predictions.
image.png
4.1 What is curse of dimensionality?
Ans. In machine learning, we often have high-dimensional data. If we’ve a data with 60 different features, we’re working in a space with 60 dimensions. If we’re analyzing grayscale images sized 50 x 50, we’re working in a space with 2,500 dimensions. If the images are RGB-colored, the dimensionality increases to 7,500 dimensions (one dimension for each color channel in each pixel in the image).
High dimensional data is when a dataset a number of features (p) that is bigger than the number of observations (N). High dimensional data is the problem that leads to the curse of dimensionality. The equation for high dimensional data is usually written like p >> N.
Curse of Dimensionality shows that as the number of features increases, the classifier model’s performance increases as well until we reach the optimal number of features. Adding more features based on the same size as the training set will then degrade the classifier’s performance.
To mitigate the problems associated with high dimensional data ‘Dimensionality reduction techniques are used. - ’Feature selection’ or ‘Feature extraction’
(i) Feature selection Techniques 1. Low Variance filter :- Attributes with very low variance are eliminated. Attributes that do not have such much variance will assume an almost constant value and do not contribute to the predictability of the model. 2. High Correlation filter 3. Multicollinearity 4. Feature Ranking :- Decision Tree models such as CART can rank the attributes based on their importance or contribution to the predictability of the model.
(ii) Feature selection Techniques In feature extraction techniques, the high dimensional attributes are combined in low dimensional components (PCA or ICA) or factored into low dimensional factors (FA). e.g., PCA
Gradient Descent is an optimization algorithm in machine learning used to minimize a function by iteratively moving towards the minimum value of the function. In machine learning, more often that not we try to minimize loss functions (like Mean Squared Error). By minimizing the loss function , we can improve our model, and Gradient Descent is one of the most popular algorithms used for this purpose. In mathematics, gradient descent iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.
Gradient Descent Algorithm
6.1 How to minimize the Loss/Cost function?
Ans. ‘Loss’ in Machine learning helps us understand the difference between the predicted value \(\hat y\) & the actual value \(y\). The Function used to quantify this loss during the training phase in the form of a single real number is known as “Loss Function”. For Linear Regression, our cost function is \[ L = \sum _{i = 0}^{n} (y_i - \hat y_i)^2 = \sum _{i = 0}^{n} (y_i - (mx_i + b))^2\]
this is basically the sum of squared differences, where \(\hat y\) = \(mx + b\). Now, we already know \(y_i\) & \(x_i\), we just need to find \(m\) and \(b\) so that \(L\) will be minimum.
Let assume, somehow we find the value of m, then for b, we need to use this Gradient descent. Step 1. Take any ordinary point. Say \(b_{old} = -120\). Step 2. Finding the slope at \(b_{old}\), by using \(\partial L/\partial b\). => \(\partial L/\partial b = 2 * \sum_{i = 0}^{n} (y_i - mx_i - b)*-1\) => \(\partial L/\partial b = -2 * \sum _{i = 0}^{n} (y_i - mx_i - b)\) Step 3. Putting \(b\) as \(b_{old}\), we will get the slope at that point, Step 4. Now \(b_{new}\) = \(b_{old} -\)\(\eta * \text {slope @ } b_{old}\). Step 5. And the process continues until we find the \(L_{min}\).
\(\text {This eq. is gradient descent.}\)\[ b_{new} = b_{old} - \eta * \text{slope @ } b_{old}\]\[ m_{new} = m_{old} - \eta * \text{slope @ } m_{old}\] where, \(\eta\) = learning rate, Learning rate is a hyper-parameter that is used to scale the magnitude of parameter updates during gradient descent. The choice of the value for learning rate can impact two things: 1) how fast the algorithm learns and 2) whether the cost function is minimized or not. its value is basically \(0.01\) But we can change it, it is used in this eq., otherwise there will be the zigzag motion of \(b_{new}\) everytime.
6.2 When to stop in Gradient Descent?
Ans. 1. If \((b_{new} - b_{0}) < 0.01\) or when slope value doesn’t have any significant change, that means we have reach the minima.
Getting accurate b in less than 25 epochs
2. We can run loop, i.e., we will run this iteration 1000 times or 100 times only and we will get out answer in it.
Loss minimized in less than 15 epochs
6.3 Effect of learning rate?
Ans. If learning rate is too low like \(\eta = 0.03\), we were not able to achieve the slope that is required or say we lagged. If \(\eta = 0.5\) i.e., too high, we will lead from the wanted slope. So, correct \(\eta\) value is very essential to achieve the slope. And what should be the correct \(\eta\) will depends on the data.
Learning rate \(\eta\) is optimal, model converges to the minimum easily.
Less value of \(\eta\) → more epochs will be needed to reach the optimal solution i.e., more calculation and time needed and algo will work slowly.
Higher value of \(\eta\) → it overshoots but converges later. 4. Learning rate \(\eta\) is very large, it overshoots and diverges, moves away from the minima and performance decreases on learning.
Learning rate
Effect of learning rates
Step Size Gradient descent identifies the optimal value by taking big steps when we are far away to the optimal sum of the squared residual, and start to make many steps when it is close to the best solution.
Steps are big at beginning and gradually decrease at the end
Reaching from worst_fit_line to best_fit_line smoothly
6.4 Applying Gradient Descent on all parameters together
Now we need to use this gradient descent for finding the optimal value for both \(m\) and \(b\)
Similarly, we can diff. L w.r.t m => \(\partial L/\partial m = 2 * \sum_{i = 0}^{n} (y_i - mx_i - b)*-x_i\) \(\partial L/\partial m = -2 x_i * \sum_{i = 0}^{n} (y_i - mx_i - b)\) —– (2)
Step 1. We need to set initial values of \(m\) & \(b\), say \(m\) = 0, \(b\) = 1 Step 2. Then set the initial values of learning rate \(\eta\) & \(epochs\), say \(\eta\) = 0.01, \(epochs\) = 100
where, \(\text{slope @ } m_{old}\) is calculated by putting \(m_{old}\) value in (2)
Step 3. Continue to calculate \(b_{new}\) & \(m_{new}\) until get the optimal value so that our Loss \(L\) will be minimum.
Code
X,y = make_regression(n_samples=100, n_features=1, n_informative=1, n_targets=1,noise=20,random_state=13)m_arr = np.linspace(-150, 150, 10)b_arr = np.linspace(-150, 150, 10)mGrid, bGrid = np.meshgrid(m_arr,b_arr)final = np.vstack((mGrid.ravel().reshape(1,100),bGrid.ravel().reshape(1,100))).Tz_arr = []for i inrange(final.shape[0]): z_arr.append(np.sum((y - final[i,0]*X.reshape(100) - final[i,1])**2))z_arr = np.array(z_arr).reshape(10,10)fig = go.Figure(data=[go.Surface(x = m_arr, y = b_arr, z =z_arr)])fig.update_layout(title='Cost Function vs m and b both', autosize=False,width=800, height=700, margin=dict(l=65, r=50, b=65, t=90), scene =dict( xaxis_title='m', yaxis_title='b', zaxis_title='cost_function'))fig.show()
Applying gradient descent for \(b\) = 150. , \(m\) = -130, \(\eta\) = 0.001 and epochs = 40
Code
b =150; m =-127.82; lr =0.001; epochs =40all_b = []all_m = []all_cost = []for i inrange(epochs): slope_b =0; slope_m =0; cost =0for j inrange(X.shape[0]): slope_b = slope_b -2*(y[j] - (m * X[j]) - b) slope_m = slope_m -2*(y[j] - (m * X[j]) - b)*X[j] cost = cost + (y[j] - m * X[j] -b) **2 b = b - (lr * slope_b) m = m - (lr * slope_m) all_b.append(b) all_m.append(m) all_cost.append(cost)fig = px.scatter_3d(x=np.array(all_m).ravel(), y=np.array(all_b).ravel(), z=np.array(all_cost).ravel(), height =600)fig.add_trace(go.Surface(x = m_arr, y = b_arr, z =z_arr )).update_layout(scene =dict( xaxis_title='m', yaxis_title='b', zaxis_title='cost_function'))fig.show()
Code
print(f"Slope_m for Minimum loss = {all_m[-1:][0][0]}\nintercept_b for minimum Loss = {all_b[-1:][0][0]}")
Slope_m for Minimum loss = 27.722399754657694
intercept_b for minimum Loss = -2.2440532449559356
Both m and b changing gradually for the Best_Fit_Line
The beauty of this Algorithm is, you can start from any point of \(m\) and \(b\) and gradient descent will converge for sure.
Gradient Descent path
6.5 Gradient Descent step-by-step
Ans.
Gradient Descent is an optimization algorithm used to minimize a model’s cost function (sometimes loss function too), which measures how well the model fits the training data.
The cost function quantifies the difference between the predicted values and actual values of the target variable. For example, in linear regression, the cost function is typically the sum of squared errors: \[L = \sum_{i=0}^{n} (y_i - \hat{y}_i)^2\]
where \(y_i\) is the actual value and \(\hat{y}_i\) is the predicted value.
The gradient of the cost function is the vector of partial derivatives with respect to the model’s parameters (such as weights and biases). It points in the direction of the steepest ascent of the cost function. For example, for linear regression, the partial derivatives would be: \[\frac{\partial L}{\partial b}, \quad \frac{\partial L}{\partial m}\]
where \(b\) is the bias and \(m\) represents the slope or weights in the model.
The algorithm starts with an initial set of parameters (such as \(b\) and \(m\)) and updates them iteratively to minimize the cost function. For example, if \(b = 150\), \(m = -130\), learning rate \(\eta = 0.001\), and number of iterations (epochs) = \(30\).
In each iteration, the gradient of the cost function with respect to each parameter is computed. The parameters are then updated in the direction opposite to the gradient (because we want to minimize the cost). The updates are done using the following formulas: \[b_{\text{new}} = b_{\text{old}} - \eta \cdot \frac{\partial L}{\partial b}, \quad
m_{\text{new}} = m_{\text{old}} - \eta \cdot \frac{\partial L}{\partial m}\]
Here, \(\eta\) is the learning rate, which controls the step size.
The gradient points in the direction of the steepest ascent, so by moving in the opposite direction (negative gradient), we follow the path of steepest descent. This is why the method is called Gradient Descent.
The size of each step is controlled by the learning rate\(\eta\). If \(\eta\) is too large, the algorithm might overshoot the minimum and fail to converge. If it’s too small, the algorithm may converge too slowly or get stuck in local minima.
The process is repeated until the cost function converges, meaning the changes in the cost function or parameters become very small, indicating that the algorithm has found a minimum (or close to it).
Choosing the right learning rate and the number of iterations is critical for the performance of gradient descent. Too few iterations can result in underfitting, while too many iterations can be computationally expensive and may lead to overfitting in some cases.
6.6 Challenges with Gradient Descent
Ans. While gradient descent is a powerful optimization algorithm, it can also present some challenges that can affect its performance. Some of these challenges include: 1. Learning Rate Selection: The choice of learning rate can significantly impact the performance of gradient descent. If the learning rate is too high, the algorithm may overshoot the minimum, and if it is too low, the algorithm may take too long to converge. 2. Convergence Rate: The convergence rate of gradient descent can be slow for large datasets or high-dimensional spaces, which can make the algorithm computationally expensive. 3. Saddle Points: In high-dimensional spaces, the gradient of the cost function can have saddle points, which can cause gradient descent to get stuck in a plateau instead of converging to a minimum.